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Abstract 



A crucial step in processing speech audio data for information extraction, topic detection, or browsing/playback is to 
segment the input into sentence and topic units. Speech segmentation is challenging, since the cues typically present 
for segmenting text (headers, paragraphs, punctuation) are absent in spoken language. We investigate the use of 
prosody (information gleaned from the timing and melody of speech) for these tasks. Using decision tree and hidden 
Markov modeling techniques, we combine prosodic cues with word-based approaches, and evaluate performance 
on two speech corpora. Broadcast News and Switchboard. Results show that the prosodic model alone performs 
on par with, or better than, word-based statistical language models — for both true and automatically recognized 
words in news speech. The prosodic model achieves comparable performance with significantly less training data, 
and requires no hand-labeling of prosodic events. Across tasks and corpora, we obtain a significant improvement 
over word-only models using a probabilistic combination of prosodic and lexical information. Inspection reveals 
that the prosodic models capture language-independent boundary indicators described in the literature. Finally, cue 
usage is task and corpus dependent. For example, pause and pitch features are highly informative for segmenting 
news speech, whereas pause, duration and word-based cues dominate for natural conversation. 
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Zusammenfassung 



Ein wesentlicher Schritt in der Sprachverarbeitung zum Zweck der Informationsextrahierung, Themenklassi- 
fizierung oder Wiedergabe ist die Segmentierung in thematische und Satzeinheiten. Sprachsegmentierung ist 
schwierig, da die Hinweise, die dafiir gewohnlich in Texten vorzufinden sind (Uberschriften, Absatze, Interpunk- 
tion), in gesprochener Sprache fehlen. Wir untersuchen die Benutzung von Prosodie (Timing und Melodie der 
Sprache) zu diesem Zweck. Mithilfe von Entscheidungsbaumen und Hidden-Markov-Modellen kombinieren wir 
prosodische und wortbasierte Informationen, und priifen unsere Verfahren anhand von zwei Sprachkorpora, Broad- 
cast News und Switchboard. Sowohl bei korrekten, als auch bei automatisch erkannten Worttranskriptionen von 
Broadcast News zeigen unsere Ergebnisse, daB Prosodiemodelle alleine eine gleichgute oder bessere Leistung 
als die wortbasieren statistischen Sprachmodelle erbringen. Dabei erzielt das Prosodiemodell eine vergleichbare 
Leistung mit wesentlich weniger Trainingsdaten und bedarf keines manuellen Transkribierens prosodischer Eigen- 
schaften. Fiir beide Segmentierungsarten und Korpora erzielen wir eine signifikante Verbesserung gegeniiber rein 
wortbasierten Modellen, indem wir prosodische und lexikalische Informationsquellen probabilistisch kombinieren. 
Eine Untersuchung der Prosodiemodelle zeigt, daB diese auf sprachunabhangige, in der Literatur beschriebene 
Segmentierungsmerkmale ansprechen. Die Auswahl der Merkmale hangt wesentlich von Segmentierungstyp und 
Korpus ab. Zum Beispiel sind Pausen und FO-Merkmale vor allem fiir Nachrichtensprache informativ, wahrend 
zeitdauer- und wortbasierte Merkmale in natiirlichen Gesprachen dominieren. 
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Resume 



Une etape cmciale dans le traitement de la parole pour 1' extraction d' information, la detection du sujet de conver- 
sation et la navigation est la segmentation du discours. Celle-ci est difficile car les indices aidant a segmenter un 
texte (en-tetes, paragraphes, ponctuation) n'apparaissent pas dans le language parle. Nous etudions I'usage de la 
prosodie (F information extraite du rythme et de la melodic de la parole) a cet effet. A I'aide d'arbres de decision 
et de chaines de Markov cachees, nous combinons les indices prosodiques avec le modele du langage. Nous eval- 
uons notre algorithme sur deux corpora, Broadcast News et Switchboard. Nos resultats indiquent que le modele 
prosodique est equivalent ou superieur au modele du langage, et qu'il requiert moins de donnees d'entramement. II 
ne necessite pas d' annotations manuelles de la prosodie. De plus, nous obtenons un gain significatif en combinant 
de maniere probabiliste rinformation prosodique et lexicale, et ce pour differents corpora et applications. Une 
inspection plus detaillee des resultats revele que les modeles prosodiques identifient les indicateurs de debut et de fin 
de segments, tel que decrit dans la litterature. Finalement, I'usage des indices prosodiques depend de I'application 
et du corpus. Par exemple, le ton s'avere extremement utile pour la segmentation des bulletins televises, alors que 
les caracteristiques de duree et celles extraites du modele du langage servent davantage pour la segmentation de 
conversations naturelles. 
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1 Introduction 

1 . 1 Why process audio data ? 

Extracting infomation from audio data allows exami- 
nation of a much wider range of data sources than does 
text alone. Many sources (e.g., interviews, conversa- 
tions, news broadcasts) are available only in audio 
form. Furthermore, audio data is often a much richer 
source than text alone, especially if the data was orig- 
inally meant to be heard rather than read (e.g., news 
broadcasts). 

1.2 Why automatic segmentation? 

Past automatic information extraction systems have 
depended mostly on lexical information for segmen- 
tation (Kubala et al., 1998; Allan et al., 1998; Hearst, 
1997; Kozima, 1993; Yamron et al., 1998, among oth- 
ers). A problem for the text-based approach, when 
applied to speech input, is the lack of typographic 
cues (such as headers, paragraphs, sentence punctua- 
tion, and capitaUzation) in continuous speech. 

A crucial step toward robust information extrac- 
tion from speech is the automatic determination of 
topic, sentence, and phrase boundaries. Such loca- 
tions are overt in text (via punctuation, capitalization, 
formatting) but are absent or "hidden" in speech out- 
put. Topic boundaries are an important prerequisite 
for topic detection, topic tracking, and summarization. 
They are further helpful for constraining other tasks 
such as coreference resolution (e.g., since anaphoric 
references do not cross topic boundaries). Finding 
sentence boundaries is a necessary first step for topic 
segmentation. It is also necessary to break up long 
stretches of audio data prior to parsing. In addition, 
modeling of sentence boundaries can benefit named 
entity extraction from automatic speech recognition 
(ASR) output, for example by preventing proper nouns 
spanning a sentence boundary from being grouped to- 
gether. 

1.3 Why use prosody? 

When spoken language is converted via ASR to a 
simple stream of words, the timing and pitch patterns 
are lost. Such patterns (and other related aspects that 
are independent of the words) are known as speech 



prosody. In all languages, prosody is used to convey 
structural, semantic, and functional information. 

Prosodic cues are known to be relevant to dis- 
course structure across languages (e.g., Vaissiere, 
1983) and can therefore be expected to play an im- 
portant role in various information extraction tasks. 
Analyses of read or spontaneous monologues in lin- 
guistics and related fields have shown that informa- 
tion units, such as sentences and paragraphs, are of- 
ten demarcated prosodically. In English and related 
languages, such prosodic indicators include paus- 
ing, changes in pitch range and amplitude, global 
pitch declination, melody and boundary tone dis- 
tribution, and speaking rate variation. For exam- 
ple, both sentence boundaries and paragraph or topic 
boundaries are often marked by some combination 
of a long pause, a preceding final low boundary 
tone, and a pitch range reset, among other features 
(Lehiste, 1979, 1980; Brown et al., 1980; Bruce, 
1982; Thorsen, 1985; Silverman, 1987; Grosz and 
Hirschberg, 1992; Sluijter and Terken, 1994; Swerts 
and Geluykens, 1994; Koopmans-van Beinum and van 
Donzel, 1996; Hirschberg and Nakatani, 1996; Naka- 
jima and Tsukada, 1997; Swerts, 1997; Swerts and 
Ostendorf, 1997). 

Furthermore, prosodic cues by their nature are rel- 
atively unaffected by word identity, and should there- 
fore improve the robustness of lexical information ex- 
traction methods based on ASR output. This may be 
particularly important for spontaneous human-human 
conversation since ASR word error rates remain much 
higher for these corpora than for read, constrained, or 
computer-directed speech (National Institute for Stan- 
dards and Technology, 1999). 

A related reason to use prosodic information is 
that certain prosodic features can be computed even 
in the absence of availability of ASR, for example, for 
a new language where one may not have a dictionary 
available. Here they could be appUed for instance for 
audio browsing and playback, or to cut waveforms 
prior to recognition to limit audio segments to dura- 
tions feasible for decoding. 

Furthermore, unlike spectral features, some 
prosodic features (e.g., duration and intonation pat- 
terns) are largely invariant to changes in channel char- 
acteristics (to the extent that they can be adequately 
extracted from the signal). Thus, the research results 
are independent of characteristics of the communica- 
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tion channel, implying that the benefits of prosody are 
significant across multiple applications. 

Finally, prosodic feature extraction can be 
achieved with minimal additional computational load 
and no additional training data; results can be inte- 
grated directly with existing conventional ASR lan- 
guage and acoustic models. Thus, performance gains 
can be evaluated quickly and cheaply, without requir- 
ing additional infrastructure. 

1.4 This study 

Past studies involving prosodic information have gen- 
erally relied on hand-coded cues (an exception is 
Hirschberg and Nakatani, 1996). We believe the 
present work to be the first that combines fully au- 
tomatic extraction of both lexical and prosodic infor- 
mation for speech segmentation. Our general frame- 
work for combining lexical and prosodic cues for tag- 
ging speech with various kinds of hidden structural 
information is a further development of earlier work 
on detecting sentence boundaries and disfluencies in 
spontaneous speech (Shriberg et al., 1997; Stolcke et 
al., 1998; Hakkani-Tiir et al., 1999; Stolcke et al., 
1999; Tiir et al., 2000) and on detecting topic bound- 
aries in Broadcast News (Hakkani-Tiir et al., 1999; 
Stolcke et al., 1999; Tur et al., 2000). In previous 
work we provided only a high-level summary of the 
prosody modeling, focusing instead on detailing the 
language modeling and model combination. 

In this paper we describe the prosodic modeling 
in detail. In addition we include, for the first time, 
controlled comparisons for speech data from two cor- 
pora differing greatly in style: Broadcast News (Graff, 
1997) and Switchboard (Godfrey et al., 1992). The 
two corpora are compared directly on the task of 
sentence segmentation, and the two tasks (sentence 
and topic segmentation) are compared for the Broad- 
cast News data. Throughout, our paradigm holds 
the candidate features for prosodic modeling constant 
across tasks and corpora. That is, we created paral- 
lel prosodic databases for both corpora, and used the 
same machine learning approach for prosodic model- 
ing in all cases. We look at results for both true words, 
and words as hypothesized by a speech recognizer. 
Both conditions provide informative data points. True 
words reflect the inherent additional value of prosodic 
information above and beyond perfect word informa- 
tion. Using recognized words allows comparison of 



degradation of the prosodic model to that of a lan- 
guage model, and also allows us to assess realistic 
performance of the prosodic model when word bound- 
ary information must be extracted based on incorrect 
hypotheses rather than forced alignments. 

Section ^ describes the methodology, including 
the prosodic modeling using decision trees, the lan- 
guage modeling, the model combination approaches, 
and the data sets. The prosodic modeling section 
is particularly detailed, outlining the motivation for 
each of the prosodic features and specifying their ex- 
traction, computation, and normalization. Section ^ 
discusses results for each of our three tasks: sentence 
segmentation for Broadcast News, sentence segmen- 
tation for Switchboard, and topic segmentation for 
Broadcast News. For each task, we examine results 
from combining the prosodic information with lan- 
guage model information, using both transcribed and 
recognized words. We focus on overall performance, 
and on analysis of which prosodic features prove most 
useful for each task. The section closes with a gen- 
eral discussion of cross-task comparisons, and issues 
for further work. Finally, in Section Q we summa- 
rize main insights gained from the study, concluding 
with points on the general relevance of prosody for 
automatic segmentation of spoken audio. 

2 Method 

2. 1 Prosodic modeling 

2.1.1 Feature extraction regions 

In all cases we used only very local features, for prac- 
tical reasons (simplicity, computational constraints, 
extension to other tasks), although in principle one 
could look at longer regions. As shown in Fig. |l], 
for each inter-word boundary, we looked at prosodic 
features of the word immediately preceding and fol- 
lowing the boundary, or alternatively within a window 
of 20 frames (200 ms, a value empirically optimized 
for this work) before and after the boundary. In bound- 
aries containing a pause, the window extended back- 
ward from the pause start, and forward from the pause 
end. (Of course, it is conceivable that a more effective 
region could be based on information about syllables 
and stress patterns, for example, extending backward 
and forward until a stressed syllable is reached. How- 
ever, the recognizer used did not model stress, so we 
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SENTENCE 
BOUNDARY 



after a 



powerful earthquake hit last 



night (pause) at 



eleven we bring you live coverage 



200ms 200ms 



200ms 



200ms 



Fig. 1: Feature extraction regions for each inter- word boundary 



preferred the simpler, word-based criterion used here.) 

We extracted prosodic features reflecting pause 
durations, phone durations, pitch information, and 
voice quality information. Pause features were ex- 
tracted at the inter- word boundaries. Duration, FO, 
and voice quality features were extracted mainly from 
the word or window preceding the boundary (which 
was found to carry more prosodic information for 
these tasks than the speech following the boundary; 
Shriberg et al., 1997). We also included pitch-related 
features reflecting the difference in pitch range across 
the boundary. 

In addition, we included nonprosodic features that 
are inherently related to the prosodic features, for ex- 
ample, features that make a prosodic feature undefined 
(such as speaker turn boundaries) or that would show 
up if we had not normalized appropriately (such as 
gender, in the case of FO). This allowed us both to 
better understand feature interactions, and to check 
for appropriateness of normalization schemes. 

We chose not to use amplitude- or energy-based 
features, since previous work showed these features to 
be both less reliable than and largely redundant with 
duration and pitch features. A main reason for the 
lack of robustness of the energy cues was the high 
degree of channel variability in both corpora exam- 
ined, even after application of various normalization 
techniques based on the signal-to-noise ratio distri- 
bution characteristics of, for example, a conversation 
side (the speech recorded from one speaker in the 
two-party conversation) in Switchboard. Exploratory 
work showed that energy measures can correlate with 
shows (news programs in the Broadcast News corpus), 
speakers, and so forth, rather than with the structural 



locations in which we were interested. Duration and 
pitch, on the other hand, are relatively invariant to 
channel effects (to the extent that they can be ade- 
quately extracted). 

In training, word boundaries were obtained from 
recognizer forced alignments. In testing on recog- 
nized words, we used alignments for the 1 -best recog- 
nition hypothesis. Note that this results in a mismatch 
between train and test data for the case of testing on 
recognized words, that works against us. That is, 
the prosodic models are trained on better alignments 
than can be expected in testing; thus, the features se- 
lected may be suboptimal in the less robust situation 
of recognized words. Therefore, we expect that any 
benefit from the present, suboptimal approach would 
be only enhanced if the prosodic models were based 
on recognizer alignments in training as well. 

2.7.2 Features 

We included features that, based on the descriptive lit- 
erature, should reflect breaks in the temporal and into- 
national contour. We developed versions of such fea- 
tures that could be defined at each inter-word bound- 
ary, and that could be extracted by completely auto- 
matic means, without human labeling. Furthermore, 
the features were designed to be independent of word 
identities, for robustness to imperfect recognizer out- 
put. 

We began with a set of over 100 features, which, 
after initial investigations, was pared down to a smaller 
set by eliminating features that were clearly not at 
all useful (based on decision tree experiments; see 
also Section 2.1.4 ). The resulting set of features is 
described below. Features are grouped into broad 
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feature classes based on the kinds of measurements 
involved, and the type of prosodic behavior they were 
designed to capture. 

2.1.2.1 Pause features. Important cues to bound- 
aries between semantic units, such as sentences or 
topics, are breaks in prosodic continuity, including 
pauses. We extracted pause duration at each bound- 
ary based on recognizer output. The pause model used 
by the recognizer was trained as an individual phone, 
which during training could occur optionally between 
words. In the case of no pause at the boundary, this 
pause duration feature was output as 0. 

We also included the duration of the pause preced- 
ing the word before the boundary, to reflect whether 
speech right before the boundary was just starting up 
or continuous from previous speech. Most inter-word 
locations contained no pause, and were labeled as zero 
length. We did not need to distinguish between actual 
pauses and the short segmental-related pauses (e.g., 
stop closures) inserted by the speech recognizer, since 
models easily learned to distinguish the cases based 
on duration. 

We investigated both raw durations and durations 
normalized for pause duration distributions from the 
particular speaker. Our models selected the unnor- 
malized feature over the normalized version, possibly 
because of a lack of sufficient pause data per speaker. 
The unnormalized measure was apparently sufficient 
to capture the gross differences in pause duration dis- 
tributions that separate boundary from nonboundary 
locations, despite speaker variation within both cate- 
gories. 

For the Broadcast News data, which contained 
mainly monologues and which was recorded on a sin- 
gle channel, pause durations were undefined at speaker 
changes. For the Switchboard data there was signif- 
icant speaker overlap, and a high rate of backchan- 
nels (such as "uh-huh") that were uttered by a lis- 
tener during the speaker's turn. Some of these cases 
were associated with simultaneous speaker pausing 
and listener backchanneling. Because the pauses here 
did not constitute real turn boundaries, and because 
the Switchboard conversations were recorded on sep- 
arate channels, we included such speaker pauses in the 
pause duration measure (i.e., even though a backchan- 
nel was uttered on the other channel). 



2.1.2.2 Phone and rhyme duration features. An- 
other well-known cue to boundaries in speech is a 
slowing down toward the ends of units, or prebound- 
ary lengthening. Preboundary lengthening typically 
affects the nucleus and coda of syllables, so we in- 
cluded measures here that reflected duration charac- 
teristics of the last rhyme (nucleus plus coda) of the 
syllable preceding the boundary. 

Each phone in the rhyme was normalized for in- 
herent duration as follows 

■y^ phone jiuri ~ meanjphone-duri 
^-^ stdjievjphonejluri 

i 

where meanjphone_duri and std-devjphone.duTi 
are the mean and standard deviation of the current 
phone over all shows or conversations in the training 
data.lll Rhyme features included the average normal- 
ized phone duration in the rhyme, computed by divid- 
ing the measure in Eq. (|l]) by the number of phones 
in the rhyme, as well as a variety of other methods 
for normalization. To roughly capture lengthening 
of prefinal syllables in a multisyllabic word, we also 
recorded the longest normalized phone, as well as the 
longest normalized vowel, found in the preboundary 
word .3 

We distinguished phones in filled pauses (such as 
"um" and "uh") from those elsewhere, since it has been 
shown in previous work that durations of such fillers 
(which are very frequent in Switchboard) are consid- 
erably longer than those of spectrally similar vowels 
elsewhere (Shriberg, 1999). We also noted that for 
some phones, particularly nasals, errors in the rec- 
ognizer forced alignments in training sometimes pro- 
duced inordinately long (incorrect) phone durations. 
This affected the robustness of our standard deviation 
estimates; to avoid the problem we removed any clear 
outliers by inspecting the phone-specific duration his- 
tograms prior to computing standard deviations. 

In addition to using phone-specific means and 
standard deviations over all speakers in a corpus, 

' Improvements in future work could include the use of triphone- 
based normalization (on a sufficiently large corpus to assure robust 
estimates), or of normalization based on syllable position and stress 
information (given a dictionary marked for this information). 

"Using dictionary stress information would probably be a better 
approach. Nevertheless, one advantage of this simple method is 
a robustness to pronunciation variation, since the longest observed 
normalized phone duration is used, rather than some predetermined 
phone. 
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we investigated the use of speaker-specific values 
for normalization, backing off to cross-speaker val- 
ues for cases of low phone-by-speaker counts. How- 
ever, these features were less useful than the features 
from data pooled over all speakers (probably due to 
a lack of robustness in estimating the standard devi- 
ations in the smaller, speaker-specific data sets). Al- 
ternative normalizations were also computed, includ- 
ing phone_duri/meanjphone-duri (to avoid noisy 
estimates of standard deviations), both for speaker- 
independent and speaker-dependent means. 

Interestingly, we found it necessary to bin the 
normalized duration measures in order to reflect pre- 
boundary lengthening, rather than segmental informa- 
tion. Because these duration measures were normal- 
ized by phone-specific values (means and standard 
deviations), our decision trees were able to use certain 
specific feature values as clues to word identities and, 
indirectly, to boundaries. For example, the word "I" 
in the Switchboard corpus is a strong cue to a sen- 
tence onset; normalizing by the constant mean and 
standard deviation for that particular vowel resulted in 
specific values that were "learned" by the models. To 
address this, we binned all duration features to remove 
the level of precision associated with the phone-level 
correlations. 

2.1.2.3 FO features. Pitch information is typically 
less robust and more difficult to model than other 
prosodic features, such as duration. This is largely at- 
tributable to variabiUty in the way pitch is used across 
speakers and speaking contexts, complexity in rep- 
resenting pitch patterns, segmental effects, and pitch 
tracking discontinuities (such as doubling errors and 
pitch halving, the latter of which is also associated 
with nonmodal voicing). 

To smooth out microintonation and tracking er- 
rors, simplify ourFO feature computation, and identify 
speaking-range parameters for each speaker, we post- 
processed the frame-level FO output from a standard 
pitch tracker. We used an autocorrelation-based pitch 
tracker (the "get_fO" function in ESPS/Waves (ESPS, 
1993), with default parameter settings) to generate 
estimates of frame-level FO (Talkin, 1995). Postpro- 
cessing steps are outlined in Fig. ^and are described 
further in work on prosodic modeling for speaker ver- 
ification (Sonmez et al., 1998). 

The raw pitch tracker output has two main noise 



sources, which are minimized in the filtering stage. 
FO halving and doubling are estimated by a lognormal 
tied mixture model (LTM) of FO, based on histograms 
of FO vaJues collected from all data from the same 
speakerE For the Broadcast News corpus we pooled 
data from the same speaker over multiple news shows; 
for the Switchboard data, we used only the data from 
one side of a conversation for each histogram. 

For each speaker, the FO distribution was mod- 
eled by three lognormal modes spaced log 2 apart 
in the log frequency domain. The locations of 
the modes were modeled with one tied parameter 
in — log 2,/i,/i + log 2), variances were scaled to be 
the same in the log domain, and mixture weights were 
estimated by an expectation maximization (EM) algo- 
rithm. This approach allowed estimation of speaker 
FO range parameters that proved useful for FO normal- 
ization. 

Prior to the regularization stage, median filtering 
smooths voicing onsets during which the tracker is 
unstable, resulting in local undershoot or overshoot. 
We applied median filtering to windows of voiced 
frames with a neighborhood size of 7 plus or minus 3 
frames. Next, in the regularization stage, FO contours 
are fit by a simple piecewise linear model 

K 

fc=i 

where K is the number of nodes, Xk are the node lo- 
cations, and flfc and bk are the linear parameters for a 
given region. The parameters are estimated by min- 
imizing the mean squared error with a greedy node 
placement algorithm. The smoothness of the fits is 
fixed by two global parameters: the maximum mean 
squared error for deviation from a line in a given re- 
gion, and the minimum length of a region. 

The resulting filtered and stylized FO contour, an 
example of which is shown in Fig. ^ enables robust 
extraction of features such as the value of the FO slope 
at a particular point, the maximum or minimum styl- 
ized FO within a region, and a simple characterization 
of whether the FO trajectory before a word boundary 
is broken or continued into the next word. In addi- 
tion, over all data from a particular speaker, statistics 

'We settled on a cheating approach here, assuming speaker 
tracking information was available in testing, since automatic 
speaker segmentation and tracking was beyond the scope of this 
work. 
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Fig. 2: FO processing 
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Fig. 3: FO contour filtering and regularization 
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hit last night at eleven 



Fig. 4: Schematic example of stylized FO for voiced 
regions of the text. The speaker's estimated baseline 
FO (from the lognormal tied mixture modeling) is also 
indicated. 

such as average slopes can be computed for normal- 
ization purposes. These statistics, combined with the 
speaker range values computed from the speaker his- 
tograms, allowed us to easily and robustly compute 
a large number of FO features, as outlined in Sec- 
tion [2.1.2[ In exploratory work on Switchboard, we 
found that the stylized FO features yielded better re- 
sults than more complex features computed from the 
raw FO tracks. Thus, we restricted our input features 
to those computed from the processed FO tracks, and 
did the same for Broadcast News. 

We computed four different types of FO features, 
all based on values computed from the stylized pro- 
cessing, but each capturing a different aspect of into- 
national behavior: (1) FO reset features, (2) FO range 
features, (3) FO slope features, and (4) FO continuity 
features. The general characteristics captured can be 
illustrated with the help of Fig. ^ 

Reset features. The first set of features was de- 
signed to capture the well-known tendency of speak- 
ers to reset pitch at the start of a new major unit, such 
as a topic or sentence boundary, relative to where they 
left off. Typically the reset is preceded by a final fall in 
pitch associated with the ends of such units. Thus, at 
boundaries we expect a larger reset than at nonbound- 
aries. We took measurements from the stylized FO 
contours for the voiced regions of the word preceding 
and of the word following the boundary. Measure- 
ments were taken at either the minimum, maximum, 
mean, starting, or ending stylized FO value within the 
region associated with each of the words. Numer- 
ous features were computed to compare the previous 



to the following word; we computed both the log of 
the ratio between the two values, and the log of the 
difference between them, since it is unclear which 
measure would be better. Thus, in Fig. the FO dif- 
ference between "at" and "eleven" would not imply a 
reset, but that between "night" and "at" would imply a 
large reset, particularly for the measure comparing the 
minimum FO of "night" to the maximum FO of "at". 
Parallel features were also computed based on the 200 
ms windows rather than the words. 

Range features. The second set of features re- 
flected the pitch range of a single word (or window), 
relative to one of the speaker-specific global FO range 
parameters computed from the lognormal tied mixture 
modeling described earlier We looked both before 
and after the boundary, but found features of the pre- 
boundary word or window to be the most useful for 
these tasks. For the speaker-specific range parame- 
ters, we estimated FO baselines, toplines, and some 
intermediate range measures. By far the most use- 
ful value in our modeling was the FO baseUne, which 
we computed as occurring halfway between the first 
mode and the second mode in each speaker-specific 
FO histogram, i.e., roughly at the bottom of the modal 
(nonhalved) speaking range. We also estimated FO 
toplines and intermediate values in the range, but these 
parameters proved much less useful than the baselines 
across tasks. 

Unlike the reset features, which had to be de- 
fined as "missing" at boundaries containing a speaker 
change, the range features are defined at all boundaries 
for which FO estimates can be made (since they look 
only at one side of the boundary). Thus for example 
in Fig. H the FO of the word "night" falls very close 
to the speaker's FO baseline, and can be utilized irre- 
spective of whether or not the speaker changes before 
the next word. 

We were particularly interested in these features 
for the case of topic segmentation in Broadcast News, 
since due to the frequent speaker changes at actual 
topic boundaries we needed a measure that would be 
defined at such locations. We also expected speakers 
to be more likely to fall closer to the bottom of their 
pitch range for topic than for sentence boundaries, 
since the former implies a greater degree of finality. 

Slope features. Our final two sets of FO features 
looked at the slopes of the stylized FO segments, both 
for a word (or window) on only one side of the bound- 
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ary, and for continuity across the boundary. The aim 
was to capture local pitch variation such as the pres- 
ence of pitch accents and boundary tones. Slope fea- 
tures measured the degree of FO excursion before or 
after the boundary (relative to the particular speaker's 
average excursion in the pitch range), or simply nor- 
malized by the pitch range on the particular word. 

Continuity features. Continuity features measured 
the change in slope across the boundary. Here, we 
expected that continuous trajectories would correlate 
with nonboundaries, and broken trajectories would 
tend to indicate boundaries, regardless of difference 
in pitch values across words. For example, in Fig. || 
the words "last" and "night" show a continuous pitch 
trajectory, so that it is highly unlikely there is a major 
syntactic or semantic boundary at that location. We 
computed both scalar (slope difference) and categori- 
cal (rise-fall) features for inclusion in the experiments. 

2.1.2.4 Estimated voice quality features. Scalar FO 
statistics (e.g., those contributing to slopes, or min- 
imum/maximum FO within a word or region) were 
computed ignoring any frames associated with FO 
halving or doubling (frames whose highest posterior 
was not that for the modal region). However, re- 
gions corresponding to FO halving as estimated by the 
lognormal tied mixture model showed high correla- 
tion with regions of creaky voice or glottalization that 
had been independently hand-labeled by a phoneti- 
cian. Since creak may correlate with our boundaries 
of interest, we also included some categorical features, 
reflecting the presence or absence of creak. 

We used two simple categorical features. One 
feature reflected whether or not pitch halving (as 
estimated by the model) was present for at least a 
few frames, anywhere within the word preceding the 
boundary. The second version looked at whether halv- 
ing was present at the end of that word. As it turned 
out, while these two features showed up in decision 
trees for some speakers, and in the patterns we ex- 
pected, glottalization and creak are highly speaker 
dependent and thus were not helpful in our overall 
modeling. However, for speaker-dependent model- 
ing, such features could potentially be more useful. 

2.1.2.5 Other features. We included two types of 
nonprosodic features, turn-related features and gen- 
der features. Both kinds of features were legitimately 



available for our modeling, in the sense that standard 
speech recognition evaluations made this information 
known. Whether or not speaker change markers would 
actually be available depends on the application. It is 
not unreasonable however to assume this information, 
since automatic algorithms have been developed for 
this purpose (e.g., Przybocki and Martin, 1999; Liu 
and Kubala, 1999; Sonmez et al., 1999). Such non- 
prosodic features often interact with prosodic features. 
For example, turn boundaries cause certain prosodic 
features (such as FO difference across the boundary) to 
be undefined, and speaker gender is highly correlated 
with FO. Thus, by including the features we could 
better understand feature interactions and check for 
appropriateness of normalization schemes. 

Our turn-related features included whether or not 
the speaker changed at a boundary, the time elapsed 
from the start of the turn, and the turn count in the con- 
versation. The last measure was included to capture 
structure information about the data, such as the pre- 
ponderance of topic changes occurring early in Broad- 
cast News shows, due to short initial summaries of 
topics at the beginning of certain shows. 

We included speaker gender mainly as a check to 
make sure the FO processing was normalized properly 
for gender differences. That is, we initially hoped that 
this feature would not show up in the trees. However, 
we learned that there are reasons other than poor nor- 
malization for gender to occur in the trees, including 
potential truly stylistic differences between men and 
women, and structure differences associated with gen- 
der (such as differences in lengths of stories in Broad- 
cast News). Thus, gender revealed some interesting 
inherent interactions in our data, which are discussed 
In addition to speaker gender. 



further in Section 3.3 



we included the gender of the listener, to investigate 
the degree to which features distinguishing boundaries 
might be affected by sociolinguistic variables. 

2.1.3 Decision trees 

As in past prosodic modeling work (Shriberg et al., 
1997), we chose to use CART-style decision trees 
(Breiman et al., 1984), as implemented by the IND 
package (Buntine and Caruana, 1992). The software 
offers options for handling missing feature values (im- 
portant since we did not have good pitch estimates for 
all data points), and is capable of processing large 
amounts of ti-aining data. Decision trees ai^e prob- 
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abilistic classifiers that can be characterized briefly 
as foflows. Given a set of discrete or continuous 
features and a labeled training set, the decision tree 
construction algorithm repeatedly selects a single fea- 
ture that, according to an information-theoretic crite- 
rion (entropy), has the highest aredictive value for the 
classification task in questions The feature queries 
are arranged in a hierarchical fashion, yielding a tree 
of questions to be asked of a given data point. The 
leaves of the tree store probabilities about the class 
distribution of all samples falling into the correspond- 
ing region of the feature space, which then serve as 
predictors for unseen test samples. Various smooth- 
ing and pruning techniques are commonly employed 
to avoid overfitting the model to the training data. 

Although any of several probabilistic classifiers 
(such as neural networks, exponential models, or naive 
Bayes networks) could be used as posterior probabil- 
ity estimators, decision trees allow us to add, and 
automatically select, other (nonprosodic) features that 
might be relevant to the task — including categorical 
features. Furthermore, decision trees make no as- 
sumptions about the shape of feature distributions; 
thus it is not necessary to convert feature values to 
some standard scale. And perhaps most importantly, 
decision trees offer the distinct advantage of inter- 
pretability. We have found that human inspection of 
feature interactions in a decision tree fosters an intu- 
itive understanding of feature behaviors and the phe- 
nomena they reflect. This understanding is crucial for 
progress in developing better features, as well as for 
debugging the feature extraction process itself. 

The decision tree served as a prosodic model for 
estimating the posterior probability of a (sentence or 
topic) boundary at a given inter- word boundary, based 
on the automatically extracted prosodic features. We 
define Fi as the features extracted from a window 
around the ith potential boundary, and as the bound- 
ary type (boundary/no-boundary) at that position. For 
each task, decision trees were trained to predict the 
ith boundary type, i.e., to estimate P{Ti\Fi, W). By 
design, this decision was only weakly conditioned on 
the word sequence W, insofar as some of the prosodic 
features depend on the phonetic alignment of the word 
models. We preferred the weak conditioning for ro- 



bustness to word errors in speech recognizer output. 
Missing feature values in Fi occurred mainly for the 
FO features (due to lack of robust pitch estimates for 
an example), but also at locations where features were 
inherently undefined (e.g., pauses at turn boundaries). 
Such cases were handled in testing by sending the 
test sample down each tree branch with the propor- 
tion found in the training set at that node, and then 
averaging the corresponding predictions. 

2.7.4 Feature selection algorithm 

Our initial feature sets contained a high degree of fea- 
ture redundancy because, for example, similar features 
arose from changing only normalization schemes, and 
others (such as energy and FO) are inherently corre- 
lated in speech production. The greedy nature of the 
decision tree learning algorithm implies that larger 
initial feature sets can yield suboptimal results. The 
availability of more features provides greater opportu- 
nity for "greedy" features to be chosen; such features 
minimize entropy locally but are suboptimal with re- 
spect to entropy minimization over the whole tree. 
Furthermore, it is desirable to remove redundant fea- 
tures for computational efficiency and to simplify in- 
terpretation of results. 

To automatically reduce our large initial candi- 
date feature set to an optimal subset, we developed 
an iterative feature selection algorithm that involved 
running multiple decision trees in training (sometimes 
hundreds for each task). The algorithm combines el- 
ements of brute-force search with previously deter- 
mined human-based heuristics for narrowing the fea- 
ture space to good groupings of features. We used 
the entropy reduction of the overall tree after cross- 
validation as a criterion for selecting the best sub- 
tree. Entropy reduction is the difference in test-set 
entropy between the prior class distribution and the 
posterior distribution estimated by the tree. It is a 
more fine-grained metric than classification accuracy, 
and is thus the more appropriate measure to use for 
any of the model combination approaches described 



in Section 2.3 



''For multivalued or continuous features, the algoiithm also de- 
termines optimal feature value subsets or thresholds, respectively, 
to compare the feature to. 



The algorithm proceeds in two phases. In the first 
phase, the large number of initial candidate features 
is reduced by a leave-one-out procedure. Features 
that do not reduce performance when removed are 
eliminated from further consideration. The second 
phase begins with the reduced number of features, 
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and performs a beam search over all possible subsets 
of features. Because our initial feature set contained 
over 100 features, we split the set into smaller sub- 
sets based on our experience with feature behaviors. 
For each subset we included a set of "core" features, 
which we knew from human analyses of results served 
as catalysts for other features. For example, in all sub- 
sets, pause duration was included, since without this 
feature present, duration and pitch features are much 
less discriminative for the boundaries of interest.B 



to use the slightly more complex forward-backward al- 
gorithm (Baum et al., 1970) to maximize the posterior 
probability of each individual boundary classification 

argmaxP(TilVF) . (3) 

Ti 

This approach minimizes the expected per-boundary 
classification error rate (Dermatas and Kokkinakis, 
1995). 



2.2 Language modeling 

The goal of language modeling for our segmenta- 
tion tasks is to capture information about segment 
boundaries contained in the word sequences. We de- 
note boundary classifications by T = Ti , . . . , Tk and 
use W = W\ , . . . , Wm for the word sequence. Our 
general approach is to model the joint distribution of 
boundary types and words in a hidden Markov model 
(HMM), the hidden variable in this case being the 
boundaries Ti (or some related variable from which 
Ti can be inferred). Because we had hand-labeled 
training data available for all tasks, the HMM param- 
eters could be trained in supervised fashion. 

The structure of the HMM is task specific, as de- 
scribed below, but in all cases the Markovian char- 
acter of the model allows us to efficiently perform 
the probabilistic inferences desired. For example, for 
topic segmentation we extract the most likely overall 
boundary classification 



argmaxP(T|M^) 

T 



(2) 



using the Viterbi algorithm (Viterbi, 1967). This op- 
timization criterion is appropriate because the topic 
segmentation evaluation metric prescribed by the TDT 
program (Doddington, 199^) rewards overall consis- 
tency of the segmentation.El 

For sentence segmentation, the evaluation metric 
simply counts the number of correctly labeled bound- 
aries (see Section 2.4.4 ). Therefore, it is advantageous 



'The success of this approach depends on the makeup of the 
initial feature sets, since highly correlated useful features can cancel 
each other out during the first phase. This problem can be addressed 
by forming initial feature subsets that minimize within-set cross- 
feature correlations. 

^For example, given thi'ee sentences S1S2S3 and strong evidence 
that there is a topic boundary between s j and S3 , it is better to output 
a boundary either before or after si, but not in both places. 



2.2.7 Sentence segmentation 

We relied on a hidden-event N-gram language model 
(LM) (Stolcke and Shriberg, 1996; Stolcke et al., 
1998). The states of the HMM consist of the end- 
of-sentence status of each word (boundary or no- 
boundary), plus any preceding words and possibly 
boundary tags to fill up the N-gram context {N = 4 in 
our experiments). Transition probabilities are given 
by N-gram probabilities estimated from annotated, 
boundary-tagged training data using Katz backoff 
(Katz, 1987). For example, the bigram parameter 
P(<S>|tonight) gives the probability of a sentence 
boundary following the word "tonight". HMM obser- 
vations consist of only the current word portion of the 
underlying N-gram state (with emission likelihood 1), 
constraining the state sequence to be consistent with 
the observed word sequence. 

2.2.2 Topic segmentation 

We first constructed 100 individual unigram topic clus- 
ter language models, using the multipass fc-means al- 
gorithm described in (Yamron et al., 1998). We used 
the pooled Topic Detection and Tracking (TDT) Pilot 
and TDT-2 training data (Cieri et al., 1999). We re- 
moved stories with fewer than 300 and more than 3000 
words, leaving 19,916 stories with an average length 
of 538 words. Then, similar to the Dragon topic seg- 
mentation approach (Yamron et al., 1998), we built 
an HMM in which the states are topic clusters, and 
the observations are sentences. The resulting HMM 
forms a complete graph, allowing transition between 
any two topic clusters. In addition to the basic HMM 
segmenter, we incorporated two states for modeling 
the initial and final sentences of a topic segment. We 
reasoned that this can capture formulaic speech pat- 
terns used by broadcast speakers. Likehhoods for the 



13 



start and end models are obtained as the unigram lan- 
guage model probabilities of the topic-initial and final 
sentences, respectively, in the training data. Note that 
single start and end states are shared for all topics, and 
traversal of the initial and final states is optional in 
the HMM topology. The topic cluster models work 
best if whole blocks of words or "pseudo-sentences" 
are evaluated against the topic language models (the 
likelihoods are otherwise too noisy). We therefore 
presegment the data stream at pauses exceeding 0.65 
second, as process we will refer to as "chopping". 



2.3.2 Integrated hidden Markov modeling 

Our second model combination approach is based on 
the idea that the HMM used for lexical modeling can 
be extended to "emit" both words and prosodic obser- 
vations. The goal is to obtain an HMM that models the 
joint distribution PiyV, F, T) of word sequences W, 
prosodic features F, and hidden boundary types T in a 
Markov model. With suitable independence assump- 
tions we can then apply the familiar HMM techniques 
to compute 

argmaxP(r|VF,i^) 



2.3 Model combination 

We expect prosodic and lexical segmentation cues 
to be partly complementary, so that combining both 
knowledge sources should give superior accuracy over 
using each source alone. This raises the issue of how 
the knowledge sources should be integrated. Here, we 
describe two approaches to model combination that 
allow the component prosodic and lexical models to 
be retained without much modification. While this is 
convenient and computationally efficient, it prevents 
us from explicitly modeling interactions (i.e., statisti- 
cal dependence) between the two knowledge sources. 
Other researchers have proposed model architectures 
based on decision trees (Heeman and Allen, 1997) 
or exponential models (Beeferman et al., 1999) that 
can potentially integrate the prosodic and lexical cues 
discussed here. In other work (Stolcke et al., 1998; 
Tiir et al., 2000) we have started to study integrated 
approaches for the segmentation tasks studied here, al- 
though preliminary results show that the simple com- 
bination techniques are very competitive in practice. 



or 

argmax P(T, I , 

Ti 

which are now conditioned on both lexical and 
prosodic cues. We describe this approach for sen- 
tence segmentation HMMs; the treatment for topic 
segmentation HMMs is mostly analogous but some- 
what more involved, and described in detail elsewhere 
(Tiir et al., 2000). 

To incorporate the prosodic information into the 
HMM, we model prosodic features as emissions from 
relevant HMM states, with likelihoods P{Fi\Ti, W), 
where Fi is the feature vector pertaining to potential 
boundary Tj. For example, an HMM state represent- 
ing a sentence boundary <S> at the current position 
would be penalized with the likelihood P{Fi\<S>). 
We do so based on the assumption that prosodic ob- 
servations are conditionally independent of each other 
given the boundary types Ti and the words W. Under 
these assumptions, a complete path through the HMM 
is associated with the total probability 

P{W,T)\{P{Fi\Ti,W)=P{W,F,T) , (5) 



2.3. 1 Posterior probability interpolation 

Both the prosodic decision tree and the language 
model (via the forward-backward algorithm) estimate 
posterior probabiUties for each boundary type Tj. We 
can arrive at a better posterior estimator by linear in- 
terpolation: 

P{Ti \W,F)^ APlm (T, I T4^) + ( 1 - A)Pdt (^^ | F, , W) 

(4) 

where A is a parameter optimized on held-out data to 
optimize the overall model performance. 



as desired. 

The remaining problem is to estimate the likeli- 
hoods P{Fi\Ti, W). Note that the decision tree esti- 
mates posteriors PDT(Pi|Pi. W). These can be con- 
verted to likelihoods using Bayes' rule as in 



P{Fi\Ti,W) = 



P{Fi\W)P^T{T,\Fi,W) 
P{Ti\W) 



(6) 



The term P(Fj | W) is a constant for all choices of Tj 
and can thus be ignored when choosing the most prob- 
able one. Next, because our prosodic model is pur- 
posely not conditioned on word identities, but only on 
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aspects of W that relate to time alignment, we approx- 
imate P(T,\W) « PiT,). Instead of explicitly divid- 
ing the posteriors, we prefer to downsample the train- 
ing set to make P{Ti = yes) = P{Ti = no) = i. A 
beneficial side effect of this approach is that the deci- 
sion tree models the lower-frequency events (segment 
boundaries) in greater detail than if presented with the 
raw, highly skewed class distribution. 

When combining probabilistic models of different 
types, it is advantageous to weight the contributions 
of the language models and the prosodic trees rela- 
tive to each other We do so by introducing a tun- 
able model combination weight (MCW), and by using 
-Pdt(-Fj|Jj,M^)"™ as the effective prosodic likeli- 
hoods. The value of MCW is optimized on held-out 
data. 



2.3.3 HMM posteriors as decision tree features 

A third approach could be used to combine the lan- 
guage and prosodic models, although for practical rea- 
sons we chose not to use it in this work. In this 
approach, an HMM incorporating only lexical infor- 
mation is used to compute posterior prob abiltie s of 

A 



boundary types, as described in Section [2.3.1 



prosodic decision tree is then trained, using the HMM 
posteriors as additional input features. The tree is free 
to combine the word-based posteriors with prosodic 
features; it can thus model limited forms of depen- 
dence between prosodic and word-based information 
(as summarized in the posteriors). 

A severe drawback of using posteriors in the deci- 
sion tree, however, is that in our current paradigm, the 
HMM is trained on correct words. In testing, the tree 
may therefore grossly overestimate the informative- 
ness of the word-based posteriors based on automatic 
transcriptions. Indeed, we found that on a hidden- 
event detection task similar to sentence segmentation 
(Stolcke et al., 1998) this model combination method 
worked well on true words, but faired worse than the 
other approaches on recognized words. To remedy 
the mismatch between training and testing of the com- 
bined model, we would have to train, as well as test, 
on recognized words; this would require computation- 
ally intensive processing of a large corpus. For these 
reasons, we decided not to use HMM posteriors as tree 
features in the present studies. 



2.3.4 Alternative models 

A few additional comments are in order regarding 
our choice of model architectures and possible alter- 
natives. The HMMs used for lexical modeling are 
likelihood models, i.e., they model the probabilities 
of observations given the hidden variables (boundary 
types) to be inferred, while making assumptions about 
the independence of the observations given the hidden 
events. The main virtue of HMMs in our context is that 
they integrate the local evidence (words and prosodic 
features) with models of context (the N-gram history) 
in a very computationally efficient way (for both train- 
ing and testing). A drawback is that the independence 
assumptions may be inappropriate and may therefore 
inherently Umit the performance of the model. 

The decision trees used for prosodic modeling, 
on the other hand, are posterior models, i.e., they 
directly model the probabilities of the unknown vari- 
ables given the observations. Unlike likelihood-based 
models, this has the advantages that model training 
explicitly enhances discrimination between the target 
classifications (i.e., boundary types), and that input 
features can be combined easily to model interac- 
tions between them. Drawbacks are the sensitivity 
to skewed class distributions (as pointed out in the 
previous section), and the fact that it becomes com- 
putationally expensive to model interactions between 
multiple target variables (e.g., adjacent boundaries). 
Furthermore, input features with large discrete ranges 
(such as the set of words) present practical problems 
for many posterior model architectures. 

Even for the tasks discussed here, other modeling 
choices would have been practical, and await com- 
parative study in future work. For example, posterior 
lexical models (such as decision trees or neural net- 
work classifiers) could be used to predict the boundary 
types from words and prosodic features together, us- 
ing word-coding techniques developed for tree-based 
language models (Bahl et al., 1989). Conversely, we 
could have used prosodic likelihood models, remov- 
ing the need to convert posteriors to likelihoods. For 
example, the continuous feature distributions could be 
modeled with (mixtures of) multidimensional Gaus- 
sians (or other types of distributions), as is commonly 
done for the spectral features in speech recognizers 
(Digalakis and Murveit, 1994, among others). 
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2.4 Data 



2.4.3 Word recognition 



2.4.1 Speech data and annotations 

Switchboard data used in sentence segmentation was 
drawn from a subset of the corpus (Godfrey et al., 
1992) that had been hand-labeled for sentence bound- 
aries (Meteer et al., 1995) by the Linguistic Data 
Consortium (LDC). Broadcast News data for topic 
and sentence segmentation was extracted from the 
LDC's 1997 Broadcast News (BN) release. Sentence 
boundaries in BN were automatically determined us- 
ing the MITRE sentence tagger (Palmer and Hearst, 
1997) based on capitalization and punctuation in the 
transcripts. Topic boundaries were derived from the 
SGML markup of story units in the transcripts. Train- 
ing of Broadcast News language models for sentence 
segmentation also used an additional 130 million 
words of text-only transcripts from the 1996 Hub-4 
language model corpus, in which sentence boundaries 
had been marked by SGML tags. 



2.4.2 Training, tuning, and test sets 

Table [l| shows the amount of data used for the various 
tasks. For each task, separate datasets were used for 
model training, for tuning any free parameters (such 
as the model combination and posterior interpolation 
weights), and for final testing. In most cases the lan- 
guage model and the prosodic model components used 
different amounts of training data. 

As is common for speech recognition evaluations 
on Broadcast News, frequent speakers (such as news 
anchors) appear in both training and test sets. By 
contrast, in Switchboard our train and test sets did 
not share any speakers. In both corpora, the average 
word count per speaker decreased roughly monoton- 
ically with the percentage of speakers included. In 
particular, the Broadcast News data contained a large 
number of speakers who contributed very few words. 
A reasonably meaningful statistic to report for words 
per speaker is thus a weighted average, or the aver- 
age number of datapoints by the same speaker On 
that measure, the two corpora had similar statistics: 
6687.11 and 7525.67 for Broadcast News and Switch- 
board, respectively. 



Experiments involving recognized words used the 1- 
best output from SRI's DECIPHER large-vocabulary 
speech recognizer. We simplified processing by skip- 
ping several of the computationally expensive or cum- 
bersome steps often used for optimum performance, 
such as acoustic adaptation and multiple-pass decod- 
ing. The recognizer performed one bigram decoding 
pass, followed by a single N-best rescoring pass using 
a higher-order language model. The Switchboard test 
set was decoded with a word error rate of 46.7% using 
acoustic models developed for the 1997 Hub-5 evalua- 
tion (National Institute for Standards and Technology, 
1997). The Broadcast News recognizer was based on 
the 1997 SRI Hub-4 recognizer (Sankar et al., 1998) 
and had a word error rate of 30.5% on the test set used 
in our study. 

2.4.4 Evaluation metrics 

Sentence segmentation performance for true words 
was measured by boundary classification error, i.e. the 
percentage of word boundaries labeled with the incor- 
rect class. For recognized words, we first performed 
a string alignment of the automatically labeled recog- 
nition hypothesis with the reference word string (and 
its segmentation). Based on this alignment we then 
counted the number of incorrectly labeled, deleted, 
and inserted word boundaries, expressed as a percent- 
age of the total number of word boundaries. This 
metric yields the same result as the boundary classi- 
fication error rate if the word hypothesis is correct. 
Otherwise, it includes additional errors from inserted 
or deleted boundaries, in a manner similar to standard 
word error scoring in speech recognition. Topic seg- 
mentation was evaluated using the metric defined by 
NIST for the TDT-2 evaluation (Doddington, 1998). 

3 Results and discussion 

The following sections describe results from the 
prosodic modeling approach, for each of our three 
tasks. The first three sections focus on the tasks 
individually, detailing the features used in the best- 
performing tree. For sentence segmentation, we report 
on trees trained on non-downsampled data, as used in 
the posterior interpolation approach. For all tasks, 
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Table 1 : Size of speech data sets used for model training and testing for the three segmentation tasks 



Tssk 


Tr a i n 1 n (T 




T'l 1 n 1 n (T 


Test 




LM 


Prosody 






SWB Sentence 


1788 sides 


1788 sides 


209 sides 


209 sides 


(transcribed) 


(1.2M words) 


(1.2M words) 


(103K words) 


(lOlK words) 


SWB Sentence 


1788 sides 


1788 sides 


12 sides 


38 sides 


(recognized) 


(1.2M words) 


(1.2M words) 


(6K words) 


(18K words) 


BN Sentence 


103 shows + BN96 


93 shows 


5 shows 


5 shows 




(BOM words) 


(700K words) 


(24K words) 


(2 IK words) 


BN Topic 


TDT + TDT2 


93 shows 


10 shows 


6 shows 




(10.7M words) 


(700K words) 


(205K words) 


(44K words) 



including topic segmentation, we also trained down- 
sampled trees for the HMM combination approach. 
Where both types of trees were used (sentence seg- 
mentation), feature usage on downsampled trees was 
roughly similar to that of the non-downs ampled trees, 
so we describe only the non-downsampled trees. For 
topic segmentation, the description refers to a down- 
sampled tree. 

In each case we then look at results from combin- 
ing the prosodic information with language model in- 
formation, for both transcribed and recognized words. 
Where possible (i.e., in the sentence segmentation 
tasks), we compare results for the two alternative 
model integration approaches (combined HMM and 
interpolation). In the next two sections, we compare 
results across both tasks and speech corpora. We dis- 
cuss differences in which types of features are helpful 
for a task, as well as differences in the relative reduc- 
tion in error achieved by the different models, using a 
measure that tries to normalize for the inherent diffi- 
culty of each task. Finally, we discuss issues for future 
work. 

3.1 Task 1: Sentence segmentation of Broadcast 
News data 

3.1.1 Prosodic feature usage 

The best-performing tree identified six features for 
this task, which fall into four groups. To summarize 
the relative importance of the features in the decision 
tree we use a measure we call "feature usage", which 
is computed as the relative frequency with which that 
feature or feature class is queried in the decision tree. 



The measure increments for each sample classified 
using that feature; features used higher in the tree 
classify more samples and therefore have higher usage 
values. The feature usage was as follows (by type of 
feature): 

• (46%) Pause duration at boundary 

• (42%) Turn/no turn at boundary 

• (1 1%) FO difference across boundary 

• (01%) Rhyme duration 

The main features queried were pause, turn, and 
FO. To understand whether they behaved in the man- 
ner expected based on the descriptive literature, we 
inspected the decision tree. The tree for this task had 
29 leaves; we show the top portion of it in Fig. ||. 

The behavior of the features is precisely that ex- 
pected from the literature. Longer pause durations at 
the boundary imply a higher probability of a sentence 
boundary at that location. Speakers exchange turns al- 
most exclusively at sentence boundaries in this corpus, 
so the presence of a turn boundary implies a sentence 
boundary. The FO features all behave in the same way, 
with lower negative values raising the probability of 
a sentence boundary. These features reflect the log of 
the ratio of FO measured within the word (or window) 
preceding the boundary to the FO in the word (or win- 
dow) after the boundary. Thus, lower negative values 
imply a larger pitch reset at the boundary, consistent 
with what we would expect. 
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Fig. 5: Top levels of decision tree selected for the Broadcast News sentence segmentation task. Nodes contain 
the percentage of "else" and "S" (sentence) boundaries, respectively, and are labeled with the majority class. 
PAU_DUR=pause duration, FOs=styUzed FO feature reflecting ratio of speech before the boundary to that after that 
boundary, in the log domain. 
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3.1.2 Error reduction from prosody 

Table I summarizes the results on both transcribed and 
recognized words, for various sentence segmentation 
models for this corpus. The baseline (or "chance") 
performance for true words in this task is 6.2% error, 
obtained by labeling all locations as nonboundaries 
(the most frequent class). For recognized words, it 
is considerably higher; this is due to the non-zero 
lower bound resulting if one accounts for locations in 
which the 1-best hypothesis boundaries do not coin- 
cide with those of the reference alignment. "Lower 
bound" gives the lowest segmentation error rate possi- 
ble given the word boundary mismatches due to recog- 
nition errors. 

Results show that the prosodic model alone per- 
forms better than a word-based language model, de- 
spite the fact that the language model was trained on 
a much larger data set. Furthermore, the prosodic 
model is somewhat more robust to errorful recognizer 
output than the language model, as measured by the 
absolute increase in error rate in each case. Most im- 
portantly, a statistically significant error reduction is 
achieved by combining the prosodic features with the 
lexical features, for both integration methods. The 
relative error reduction is 19% for true words, and 
8.5% for recognized words. This is true even though 
both models contained turn information, thus violat- 
ing the independence assumption made in the model 
combination. 

3.1.3 Performance without FO features 

A question one may ask in using the prosody fea- 
tures, is how the model would perform without any 
FO features. Unlike pause, turn, and duration infor- 
mation, the FO features used are not typically extracted 
or computed in most ASR systems. We ran compar- 
ison experiments on all conditions, but removing all 
FO features from the input to the feature selection al- 
gorithm. Results are shown in Table |^, along with the 
previous results using all features, for comparison. 

As shown, the effect of removing FO features re- 
duces model accuracy for prosody alone, for both true 
and recognized words. In the case of the true words, 
model integration using the no-FO prosodic tree ac- 
tually fares slightly better than that which used all 
features, despite similar model combination weights 
in the two cases. The effect is only marginally signifi- 



cant in a Sign test, so it may indicate chance variation. 
However it could also indicate a higher degree of cor- 
relation between true words and the prosodic features 
that indicate boundaries, when FO is included. How- 
ever, for recognized words, the model with all prosodic 
features is superior to that without the FO features, both 
alone and after integration with the language model. 

3.2 Task 2: Sentence segmentation of Switchboard 
data 

3.2.1 Prosodic feature usage 

Switchboard sentence segmentation made use of a 
markedly different distribution of features than ob- 
served for Broadcast News. For Switchboard, the 
best-performing tree found by the feature selection 
algorithm had a feature usage as follows: 

• (49%) Phone and rhyme duration preceding 
boundary 

• (18%) Pause duration at boundary 

• (17%) Turn/no turn at boundary 

• (15%) Pause duration at previous word bound- 
ary 

• (01 %) Time elapsed in turn 

Clearly, the primary feature type used here is pre- 
boundary duration, a measure that was used only a 
scant 1 % of the time for the same task in news speech. 
Pause duration at the boundary was also useful, but 
not to the degree found for Broadcast News. 

Of course, it should be noted in comparing fea- 
ture usage across corpora and tasks that results here 
pertain to comparisons of the most parsimonious, best- 
performing model for each corpus and task. That is, 
we do not mean to imply that an individual feature 
such as preboundary duration is not useful in Broad- 
cast News, but rather that the minimal and most suc- 
cessful model for that corpus makes little use of that 
feature (because it can make better use of other fea- 
tures). Thus, it cannot be inferred from these results 
that some feature not heavily used in the minimal 
model is not helpful. The feature may be useful on 
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Table 2: Results for sentence segmentation on Broadcast News 



Model Transcribed words Recognized words 



LM only (130M words) 


4.1 


11.8 


Prosody only (700K words) 


3.6 


10.9 


Interpolated 


3.5 


10.8 


Combined HMM 


3.3 


11.7 


Chance 


6.2 


13.3 


Lower bound 


0.0 


7.9 



Values are word boundary classification error rates (in percent). 



Table 3: Results for sentence segmentation on Broadcast News, with and without FO features 



Model 


Transcribed Words 


Recognized Words 


LM only (130M words) 


4.1 


11.8 


All Prosody Features: 






Prosody only (700K words) 


3.6 


10.9 


Prosody+LM: Combined HMM 


3.3 




Prosody+LM: Interpolation 




10.8 


No FO Features: 






Prosody only (700K words) 


3.8 


11.3 


Prosody+LM: Combined HMM 


3.2 




Prosody+LM: Interpolation 




11.1 


Chance 


6.2 


13.3 


Lower bound 


0.0 


7.9 



Values are word boundary classification error rates (in percent). For the integrated ("Prosody + LM") models, 
results are given for the optimal model only (combined HMM for true words, interpolation of posteriors for 

recognized words.) 
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its own; however, it is not as useful ^as some other 
feature(s) made available in this studyE 

The two "pause" features are not grouped together, 
because they represent fundamentally different phe- 
nomena. The second pause feature essentially cap- 
tured the boundaries after one word such as "uh-huh" 
and "yeah", which for this work had been marked 
as followed by sentence boundaries ("yeah <Sent> 
i know what you mean").Q The previous pause in 
this case was time that the speaker had spent in lis- 
tening to the other speaker (channels were recorded 
separately and recordings were continuous on both 
sides). Since one-word backchannels (acknowledg- 
ments such as "uh-huh") and other short dialogue acts 
make up a large percentage of sentence boundaries 
in this corpus, the feature is used fairly often. The 
turn features also capture similar phenomena related 
to turn-taking. The leaf count for this tree was 236, so 
we display only the top portion of the tree in Fig. ^. 

Pause and turn information, as expected, sug- 
gested sentence boundaries. Most interesting about 
this tree was the consistent behavior of duration fea- 
tures, which gave higher probability to a sentence 
boundary when lengthening of phones or rhymes 
was detected in the word preceding the boundary. 
Although this is in line with descriptive studies of 
prosody, it was rather remarkable to us that duration 
would work at all, given the casual style and speaker 
variation in this corpus, as well as the somewhat noisy 
forced alignments for the prosodic model training. 

3.2.2 Error reduction from prosody 

Unlike the previous results for the same task on 
Broadcast News, we see in Table ^ that for Switch- 
board data, prosody alone is not a particularly good 
model. For transcribed words it is considerably worse 
than the language model; however, this difference 
is reduced for the case of recognized words (where 
the prosody shows less degradation than the language 

'One might propose a more thorough investigation by report- 
ing performance for one feature at a time. However, we found in 
examining such results that typically our features required the pres- 
ence of one or more additional features in order to be helpful. (For 
example, pitch features required the presence of the pause feature.) 
Given the large number of features used, the number of potential 
combinations becomes too large to report on fully here. 

^"Utterance" boundary is probably a better term, but for consis- 
tency we use the term "sentence" boundary for these dialogue act 
boundaries as well. 



Table 4: Results for sentence segmentation on Switch- 
board 



Model 


Transcribed 


Recognized 




words 


words 


T M onlv 

L/iVJ. Will y 


4.3 


22.8 


Prosody only 


6.7 


22.9 


Interpolated 


4.1 


22.2 


Combined HMM 


4.0 


22.5 


Chance 


11.0 


25.8 


Lower bound 


0.0 


17.6 



Values are word boundary classification error rates 
(in percent). 



model). 

Yet, despite the poor performance of prosody 
alone, combining prosody with the language model 
resulted in a statistically significant improvement over 
the language model alone (7.0% and 2.6% relative for 
true and recognized words, respectively). All dif- 
ferences were statistically significant, including the 
difference in performance between the two model in- 
tegration approaches. Furthermore, the pattern of re- 
sults for model combination approaches observed for 
Broadcast News holds as well: the combined HMM is 
superior for the case of transcribed words, but suffers 
more than the interpolation approach when applied to 
recognized words. 

3.3 Task 3: Topic segmentation of Broadcast News 
data 

3.3.1 Prosodic feature usage 

The feature selection algorithm determined five fea- 
ture types most helpful for this task: 

• (43%) Pause duration at boundary 

• (36%) FO range 

• (09%) Turn/no turn at boundary 

• (07%) Speaker gender 

• (05%) Time elapsed in turn 

The results are somewhat similar to those seen ear- 
her for sentence segmentation in Broadcast News, in 
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Fig. 6: Top levels of decision tree selected for the Switchboard sentence segmentation task. Nodes contain 
the percentage of "S" (sentence) and "else" boundaries, respectively, and are labeled with the majority class. 
"PAU.DUR"=pause duration, "RHYM"=syllable rhyme. VOWEL, PHONE and RHYME features apply to the 
word before the boundary. 



that pause, turn, and FO information are the top fea- 
tures. However, the feature usage here differs consid- 
erably from that for the sentence segmentation task, in 
that here we see a much higher use of FO information. 

Furthermore, the most important FO feature was a 
range feature (log ratio of the preceding word's FO to 
the speaker's FO baseline), which was used 2.5 times 
more often in the tree than the FO feature based on dif- 
ference across the boundary. The range feature does 
not require information about FO on the other side of 
the boundary; thus, it could be applied regardless of 
whether there was a speaker change at that location. 
This was a much more important issue for topic seg- 
mentation than for sentence segmentation, since the 
percentage of speaker changes is higher in the former 
than in the latter. 

It should be noted, however, that the importance 
of pause duration is underestimated. As explained 
earlier, pause duration was also used prior to tree 
building, in the chopping process. The decision tree 
was applied only to boundaries exceeding a certain 
duration. Since the duration threshold was found by 
optimizing for the TDT error criterion, which assigns 
greater weight to false alarms than to false rejections, 
the resulting pause threshold is quite high (over half a 
second). Separate experiments using boundaries be- 
low our chopping threshold show that trees distinguish 
much shorter pause durations for segmentation deci- 
sions, implying that prosody could potentially yield 
an even larger relative advantage for error metrics fa- 
voring a shorter chopping threshold. 

Inspecting the tree in Fig. ^ (the tree has addi- 
tional leaves; we show only the top of it), we find that 



it is easily interpretable and consistent with prosodic 
descriptions of topic or paragraph boundaries. Bound- 
aries are indicated by longer pauses and by turn infor- 
mation, as expected. Note that the pause thresholds 
are considerably higher than those used for the sen- 
tence tree. This is as expected, because of the larger 
units used here, and due to the prior chopping at long 
pause boundaries for this task. 

Most of the rest of the tree uses FO information, 
in two ways. The most useful FO range feature, 
FOs_LRMEANJCBASELN, computes the log of the 
ratio of the mean FO in the last word to the speaker's 
estimated FO baseline. As shown, lower values favor 
topic boundaries, which is consistent with speakers 
dropping to the bottom of their pitch ranges at the 
ends of topic units. The other FO feature reflects the 
height of the last word relative to a speaker's estimated 
FO range; smaller values thus indicate that a speaker is 
closer to his or her FO floor, and as would be predicted, 
imply topic boundaries. 

The speaker-gender feature was used in the tree 
in a pattern that at first suggested to us a potential 
problem with our normalizations. It was repeatedly 
used immediately after conditioning on the FO range 
feature FOs_LRMEANJCBASELN. However, inspec- 
tion of the feature value distributions by gender and by 
boundary class suggested that this was not a problem 
with normalization, as shown in Fig. ^. 

As indicated, there was no difference by gender in 
the distribution of FO values for the feature in the case 
of boundaries not containing a topic change. After 
normalization, both men and women ended nontopic 
boundaries in similar regions above their baselines. 
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Fig. 7: Top levels of decision tree selected for the Broadcast News topic segmentation task. Nodes contain the 
percentage of "else" and "TOPIC" boundaries, respectively, and are labeled with the majority class. 
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Fig. 8: Normalized distribution of FO range fea- 
ture (FOsXRMEANJCBASELN) for male and female 
speakers for topic and nontopic boundaries in Broad- 
cast News 



Since nontopic boundaries are by far the more frequent 
class (distributions in the histogram are normalized), 
the majority of boundaries in the data show no dif- 
ference on this measure by gender. For topic bound- 
aries, however, the women in a sense behave more 
"neatly" than the men. As a group, the women have a 
tighter distribution, ending topics at FO values that are 
centered closely around their FO baselines. Men, on 
the other hand, are as a group somewhat less "well- 
behaved" in this regard. They often end topics below 
their FO baselines, and showing a wider distribution 
(although it should also be noted that since these are 
aggregate distributions, the wider distribution for men 
could reflect either within- speaker or cross-speaker 
variation). 

This difference is unlikely to be due to baseline 
estimation problems, since the nontopic distributions 
show no difference. The variance difference is also 
not explained by a difference in sample size, since that 
factor would predict an effect in the opposite direction. 
One possible explanation is that men are more likely 
than women to produce regions of nonmodal voic- 
ing (such as creak) at the ends of topic boundaries; 
this awaits further study. In addition, we noted that 
nontopic pauses (i.e., chopping boundaries) are much 
more likely to occur in male than in female speech, 
a phenomenon that could have several causes. For 
example, it could be that male speakers in Broadcast 
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Table 5: Results for topic segmentation on Broadcast 
News 



Model 


Transcribed 


Recognized 




words 


words 


LM only 


0.1895 


0.1897 


Prosody only 


0.1657 


0.1731 


Combined 


0.1377 


0.1438 


HMM 






Chance 


0.3 


0.3 



Values indicate the TDT weighted segmentation cost 



metric. 



Table 6: Results for topic segmentation on Broadcast 
News 



Model 


Transcribed words 


LM only 


0.1895 


Combined HMM: 




All prosodic features 


0.1377 


No FO features 


0.1511 


Chance 


0.3 



Values indicate the TDT weighted segmentation cost 
metric. 



News are assigned longer topic segments on average, 
or that male speakers are more prone to pausing in gen- 
eral, or that males dominate the spontaneous speech 
portions where pausing is naturally more frequent. 
This finding, too, awaits further analysis. 

3.3.2 Error reduction from prosody 

Table || shows results for segmentation into topics in 
Broadcast News speech. All results reflect the word- 
averaged, weighted error metric used in the TDT-2 
evaluations (Doddington, 1998). Chance here corre- 
sponds to outputting the "no boundary" class at all 
locations, meaning that the false alarm rate will be 
zero, and the miss rate will be 1 . Since the TDT met- 
ric assigns a weight of 0.7 to false alarms, and 0.3 to 
misses, chance in this case will be 0.3. 

As shown, the error rate for the prosody model 
alone is lower than that for the language model. Fur- 
thermore, combining the models yields a significant 
improvement. Using the combined model, the er- 
ror rate decreased by 27.3% relative to the language 
model, for the correct words, and by 24.2% for recog- 
nized words. 

3.3.3 Performance without FO features 

As in the earlier case of Broadcast News sentence 
segmentation, since this task made use of FO features, 
we asked how well it would fare without any FO fea- 
tures. The experiments were conducted only for true 
words, since as shown previously in Table ||, results 
ai^e similar to those for recognized words. Results, as 



shown in Table ^ indicate a significant degradation in 
performance when the FO features are removed. 

3.4 Comparisons of error reduction across condi- 
tions 

To compare performance of the prosodic, language, 
and combined models directly across tasks and cor- 
pora, it is necessary to normalize over three sources 
of variation. First, our conditions differ in chance 
performance (since the percentage of boundaries that 
correspond to a sentence or topic change differ across 
tasks and copora). Second, the upper bound on accu- 
racy in the case of imperfect word recognition depends 
on both the word error rate of the recognizer for the 
corpus, and the task. Third, the (standard) metric we 
have used to evaluate topic boundary detection dif- 
fers from the straight accuracy metric used to assess 
sentence boundary detection. 

A meaningful metric for comparing results di- 
rectly across tasks is the percentage of the chance 
error that remains after application of the modeling. 
This measure takes into account the different chance 
values, as well as the ceiling effect on accuracy due to 
recognition errors. Thus, a model with a score of 1.0 
does no better than chance for that task, since 100% of 
the error associated with chance performance remains 
after the modeling. A model with a score close to 0.0 
is a nearly "perfect" model, since it eliminates nearly 
all the chance error Note that in the case of recog- 
nized words, this amounts to an error rate at the lower 
bound rather than at zero. 

In Fig. ^ performance on the relative error met- 
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ric is plotted by task/corpus, reliability of word cues 
(ASR or reference transcript), and model. In the case 
of the combined model, the plotted value reflects per- 
formance for whichever of the two combination ap- 
proaches (HMM or interpolation) yielded best results 
for that condition. 

Useful cross-condition comparisons can be sum- 
marized. For all tasks and as expected, performance 
suffers for recognized words compared with tran- 
scribed words. For the sentence segmentation tasks, 
the prosodic model degrades less on recognized words 
relative to true words than the word-based models. 
The topic segmentation results based on language 
model information show remarkable robustness to 
recognition errors — much more so than sentence seg- 
mentation. This can be noted by comparing the large 
loss in performance from reference to ASR word cues 
for the language model in the two sentence tasks, to 
the identical performance of reference and ASR words 
in the case of the topic task. The pattern of results can 
be attributed to the different character of the language 
model used. Sentence segmentation uses a higher- 
order N-gram that is sensitive to specific words around 
a potential boundary, whereas topic segmentation is 
based on bag-of-words models that are inherently ro- 
bust to individual word errors. 

Another important finding made visible in Fig. ^is 
that the performance of the language model alone on 
Switchboard transcriptions is unusually good, when 
compared with the performance of the language model 
alone for all other conditions (including the corre- 
sponding condition for Broadcast News). This advan- 
tage for Switchboard completely disappears on recog- 
nized words. While researchers typically have found 
Switchboard a difficult corpus to process, in the case 
of sentence segmentation on true words it is just the 
opposite — atypically easy. Thus, previous work on 
automatic segmentation on Switchboard transcripts 
(Stolcke and Shriberg, 1996) is likely to overestimate 
success for other corpora. The Switchboard sentence 
segmentation advantage is due in large part to the high 
rate of a small number of words that occur sentence- 
initially (especially "I", discourse markers, backchan- 
nels, coordinating conjunctions, and disfluencies). 

Finally, a potentially interesting pattern can be 
seen when comparing the two alternative model com- 
bination approaches (integrated HMM, or interpo- 



lation) for the sentence segmentation task.H Only 
the best-performing model combination approach for 
each condition (ASR or reference words) is noted in 
Fig. ^; however, the complete set of results is in- 
ferrable from Tables ^ and 0. As indicated in the 
tables, the same general pattern obtained for both cor- 
pora. The integrated HMM was the better approach 
on true words, but it fared relatively poorly on rec- 
ognized words. The posterior interpolation, on the 
other hand, yielded smaller, but consistent improve- 
ments over the individual knowledge sources on both 
true and recognized words. The pattern deserves fur- 
ther study, but one possible explanation is that the 
integrated HMM approach as we have implemented it 
assumes that the prosodic features are independent of 
the words. Recognition errors, however, will tend to 
affect both words (by definition) and prosodic features 
through incorrect alignments. This will cause the two 
types of observations to be correlated, violating the 
independence assumption. 

3.5 General discussion and future work 

There are a number of ways in which the studies just 
described could be improved and extended in future 
work. One issue for the prosodic modeling is that 
currently, all of our features come from a small win- 
dow around the potential boundary. It is possible 
that prosodic properties spanning a longer range could 
convey additional useful information. A second likely 
source of improvement would be to utilize information 
about lexical stress and syllable structure in defining 
features (for example, to better predict the domain 
of prefinal lengthening). Third, additional features 
should be investigated; in particular it would be worth- 
while to examine energy-related features if effective 
normalization of channel and speaker characteristics 
could be achieved. Fourth, our decision tree models 
might be improved by using alternative algorithms to 
induce combinations of our basic input features. This 
could result in smaller and/or better-performing trees. 
Finally, as mentioned earlier, testing on recognized 
words involved a fundamental mismatch with respect 
to model training, where only true words were used. 
This mismatch worked against us, since the (fair) test- 
ing on recognized words used prosodic models that 

'The interpolated model combination is not possible for topic 
segmentation, as explained earlier. 
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BN Sentence SWB Sentence 



BN Topic 




Fig. 9: Percentage of chance error remaining after application of model (allows performance to be directly compared 

across tasks). BN=Broadcast News, SWB=Switchboard, ASR=l-best recognition hypothesis, ref=transcribed 
words, LM=language model only, Pros=prosody model only, Comb=combination of language and prosody models. 
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had been optimized for alignments from true words. 
Full retraining of all model components on recognized 
words would be an ideal (albeit presently expensive) 
solution to this problem. 

Comparisons between the two speech styles in 
terms of prosodic feature usage would benefit from 
a study in which factors such as speaker overlap in 
train and test data, and the sound quality of record- 
ings, are more closely controlled across corpora. As 
noted earlier. Broadcast News had an advantage over 
Switchboard in terms of speaker consistency, since as 
is typical in speech recognition evaluations on news 
speech, it included speaker overlap in training and 
testing. This factor may have contributed to more 
robust performance for features dependent on good 
speaker normalization — ^particularly for the FO fea- 
tures, which used an estimate of the speaker's baseline 
pitch. It is also not yet clear to what extent perfor- 
mance for certain features is affected by factors such 
as recording quality and bandwidth, versus aspects 
of the speaking style itself. For example, it is pos- 
sible that a high-quality, full-bandwidth recording of 
Switchboard-style speech would show a greater use of 
prosodic features than found here. 

An added area for further study is to adapt prosodic 
or language models to the local context. For exam- 
ple. Broadcast News exhibits an interesting variety of 
shows, speakers, speaking styles, and acoustic con- 
ditions. Our current models contain only very min- 
imal conditioning on these local properties. How- 
ever, we have found in other work that tuning the 
topic segmenter to the type of broadcast show pro- 
vided significant improvement (Tiir et al., 2000). The 
sentence segmentation task could also benefit from 
explicit modeling of speaking style. For example, our 
results show that both lexical and prosodic sentence 
segmentation cues differ substantially between spon- 
taneous and planned speech. Finally, results might be 
improved by taking advantage of speaker-specific in- 
formation (i.e. behaviors or tendencies beyond those 
accounted for by the speaker-specific normalizations 
included in the prosodic modeling). Initial experi- 
ments suggest we did not have enough training data 
per speaker available for an investigation of speaker- 
specific modeling; however, this could be made pos- 
sible through additional data or the use of smoothing 
approaches to adapt global models to speaker-specific 
ones. 



More sophisticated model combination ap- 
proaches that explicitly model interactions of lexi- 
cal and prosodic features offer much promise for fu- 
ture improvements. Two candidate approaches are 
the decision trees based on unsupervised hierarchical 
word clustering of (Heeman and Allen, 1997), and 
the feature selection approach for exponential mod- 
els (Beeferman et al., 1999). As shown in Stolcke 
and Shriberg (1996) and similar to Heeman and Allen 
(1997), it is likely that the performance of our segmen- 
tation language models would be improved by moving 
to an approach based on word classes. 

Finally, the approach developed here could be ex- 
tended to other languages, as well as to other tasks. As 
noted in Section 1.3, prosody is used across languages 
to convey information units (e.g., (Vaissiere, 1983), 
among others). While there is broad variation across 
languages in the manner in which information related 
to item salience (accentuation and prominence) is con- 
veyed, there are similarities in many of the features 
used to convey boundaries. Such universals include 
pausing, pitch declination (gradual lowering of FO val- 
leys throughout both sentences and paragraphs), and 
amplitude and FO resets at the beginnings of major 
units. One could thus potentially extend this approach 
to a new language. The prosodic features would differ, 
but it is expected that for many languages, similar ba- 
sic raw features of pausing, duration, and pitch can be 
effective in segmentation tasks. In a similar vein, al- 
though prosodic features depend on the type of events 
one is trying to detect, the general approach could be 
extended to tasks beyond sentence and topic segmen- 
tation (see, for example, Hakkani-Tiir et al., 1999; 
Shriberg et al., 1998). 

4 Summary and conclusion 

We have studied the use of prosodic information for 
sentence and topic segmentation, both of which are 
important tasks for information extraction and archival 
applications. Prosodic features reflecting pause dura- 
tions, suprasegmental durations, and pitch contours 
were automatically extracted, regularized, and nor- 
malized. They required no hand-labeling of prosody; 
rather, they were based solely on time alignment in- 
formation (either from a forced alignment or from 
recognition hypotheses). 

The features were used as inputs to a decision 
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tree model, which predicted the appropriate segment 
boundary type at each inter- word boundary. We com- 
pared the performance of these prosodic predictors to 
that of statistical language models capturing lexical 
correlates of segment boundaries, as well as to com- 
bined models integrating both lexical and prosodic 
information. Two knowledge source integration ap- 
proaches were investigated: one based on interpo- 
lating posterior probability estimators, and the other 
using a combined HMM that emitted both lexical and 
prosodic observations. 

Results showed that on Broadcast News the 
prosodic model alone performed as well as (or even 
better than) purely word-based statistical language 
models, for both true and automatically recognized 
words. The prosodic model achieved comparable per- 
formance with significantly less training data, and of- 
ten degraded less due to recognition errors. Further- 
more, for all tasks and corpora, we obtained a signif- 
icant improvement over word-only models using one 
or both of our combined models. Interestingly, the 
integrated HMM worked best on transcribed words, 
while the posterior interpolation approach was much 
more robust in the case of recognized words. 

Analysis of the prosodic decision trees revealed 
that the models capture language-independent bound- 
ary indicators described in the literature, such as pre- 
boundary lengthening, boundary tones, and pitch re- 
sets. Consistent with descriptive work, larger breaks 
such as topics, showed features similar to those of 
sentence breaks, but with more pronounced pause and 
intonation patterns. Feature usage, however, was cor- 
pus dependent. While features such as pauses were 
heavily used in both corpora, we found that pitch 
is a highly informative feature in Broadcast News, 
whereas duration and word cues dominated in Switch- 
board. We conclude that prosody provides rich and 
complementary information to lexical information for 
the detection of sentence and topic boundaries in dif- 
ferent speech styles, and that it can therefore play an 
important role in the automatic segmentation of spo- 
ken language. 
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